Summarize by Aili
Generative Verifiers: Reward Modeling as Next-Token Prediction
๐ Abstract
The paper proposes Generative Verifiers (GenRM), which recast verification as next-token prediction in large language model (LLM) reasoning domains. Key points:
- GenRM is a more performant alternative to discriminative reward models, and unlocks the use of powerful tools like chain-of-thought reasoning and majority voting for better verification.
- GenRM unifies generation and verification into a single LLM, and demonstrates that such unification benefits both generation and verification.
- GenRM can effectively utilize synthetic model-generated rationales, which are noisy and sub-optimal, to identify reasoning errors in grade school math problems.
๐ Q&A
[01] Comparing GenRM with Prior Verification Approaches
1. How does GenRM compare to standard discriminative verifiers and other approaches on reasoning tasks?
- GenRM, which directly predicts Yes/No token for verification, can match or outperform the discriminative reward model (RM) and other approaches like LLM-as-a-Judge and self-consistency on algorithmic tasks like Last Letter Concatenation and Word Sorting, as well as the GSM8K math reasoning task.
- GenRM-CoT, which combines chain-of-thought with majority voting, further improves the performance over direct GenRM.
- On GSM8K, GenRM-CoT consistently outperforms all other methods, even when using model-generated (rather than human-written) verification rationales.
2. How does GenRM's use of chain-of-thought reasoning and majority voting impact its performance?
- With oracle verification CoTs, GenRM-CoT closely matches the performance of an oracle verifier on the algorithmic tasks.
- On GSM8K, GenRM-CoT is able to detect subtle reasoning errors that are missed by discriminative verifiers, by leveraging the chain-of-thought rationales.
- Majority voting across multiple CoT rationales generated by GenRM-CoT further boosts its accuracy, allowing it to nearly match the performance of an oracle verifier on algorithmic tasks.
[02] Unifying Generation and Verification
1. How does unifying solution generation with verification impact GenRM's performance?
- Unifying solution generation with verification, as done by GenRM using the next-token-prediction objective, consistently improves verification performance across all tasks compared to training GenRM solely on verification data.
- Incorporating CoT verification data into the generator's training mix leads to better solution generation performance for the GenRM-CoT verifier itself.
- This suggests that teaching the verifier to imitate correct solutions through next-token prediction is mutually beneficial for both generation and verification.
[03] Scaling Data, Model Size, and Inference-time Compute
1. How does GenRM-CoT's performance scale with increased inference-time compute?
- GenRM-CoT's performance scales gracefully with greater number of CoT rationale samples used for majority voting, outperforming greedy decoding performance within just 4 votes.
- Across different Gemma model scales (2B, 7B, 9B), the finetuned GenRM-CoT verifier outperforms the LLM-as-a-Judge approach, which also utilizes CoT and majority voting but with a more capable Gemini 1.0 Pro model.
2. How does GenRM's performance scale with increasing model size and training data?
- The performance of GenRM and GenRM-CoT verifiers scales positively with an increase in Gemma model capacity, matching the expectation that larger models can learn more from the same data under the next-token prediction loss.
- For GenRM-CoT on GSM8K, using multiple rationales per solution has a substantial positive effect on both RM accuracy and Best-of-N performance, suggesting the model benefits from the "ensembling" effect of training on noisy synthetic rationales.
- Direct GenRM verifiers trained only on verification data still outperform standard discriminative RMs as the amount of training data increases, demonstrating the effectiveness of casting verification as a next-token prediction problem.
[04] Impact of Synthetic Rationale Quality
1. How does the quality of synthetic rationales impact GenRM-CoT's performance on GSM8K?
- Using reference-guided grading to generate the synthetic rationales significantly improves GenRM-CoT's performance on GSM8K compared to using unguided synthetic rationales.
- This indicates that LLMs are better able to identify reasoning errors when provided with a reference solution for comparison, even when using the same model (Gemini 1.0 Pro) to generate both the solutions and rationales.
</output_format>
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.